On Staggered Checkpointing
نویسنده
چکیده
A consistent checkpointing algorithm saves a consistent view of a distributed application's state on stable storage. The traditional consistent checkpoint-ing algorithms require diierent processes to save their state at about the same time. This causes contention for the stable storage, potentially resulting in large overheads. Staggering the checkpoints taken by various processes can reduce checkpoint overhead 10]. This paper presents a simple approach to arbitrarily stagger the checkpoints. Our approach requires that the processes take consistent logical checkpoints, as compared to consistent physical checkpoints enforced by existing algorithms. Experimental results on nCube-2 are presented.
منابع مشابه
A Scalable Algorithm for Compiler-placed Staggered Checkpointing
To make progress in the face of possible system failures, long-running parallel applications often checkpoint, or save their state, so they can resume execution. Many current checkpointing techniques require user input, impose run-time performance penalties, or result in all processes checkpointing synchronously which leads to network and file system contention, again resulting in significant p...
متن کاملStaggered Consistent Checkpointing
ÐA consistent checkpointing algorithm saves a consistent view of a distributed application's state on stable storage. The traditional consistent checkpointing algorithms require different processes to save their state at about the same time. This causes contention for the stable storage, potentially resulting in large overheads. Staggering the checkpoints taken by various processes can reduce c...
متن کاملSome Thoughts on Distributed Recovery ( preliminary
This report deals with some aspects of distributed recovery. The report is divided into multiple parts, each part introducing a problem and a solution. The intent of this report is to present a medley of preliminary ideas, more detailed treatment may be presented elsewhere. The report deals with the following problems: A single processor failure tolerance scheme based on the distributed recover...
متن کاملSome Thoughts on Distributed Recovery ( preliminary version )
This report deals with some aspects of distributed recovery. The report is divided into multiple parts, each part introducing a problem and a solution. The intent of this report is to present a medley of preliminary ideas, more detailed treatment may be presented elsewhere. The report deals with the following problems: A single processor failure tolerance scheme based on the distributed recover...
متن کاملConsistent Logical Checkpointing
A \consistent checkpointing" algorithm saves a consistent view of the distributed system state on stable storage. The loss of computation upon a failure can be bounded by taking consistent checkpoints with adequate frequency. The traditional consistent checkpointing algorithms require the diierent processes to save their state at about the same time. This causes contention for the stable storag...
متن کامل